Queueing analysis of GPU-based inference servers with dynamic batching: A closed-form characterization

نویسندگان

چکیده

GPU-accelerated computing is a key technology to realize high-speed inference servers using deep neural networks (DNNs). An important characteristic of GPU-based that the computational efficiency, in terms processing speed and energy consumption, drastically increases by multiple jobs together batch. In this paper, we formulate as batch service queueing model with batch-size dependent times. We first show efficiency server monotonically arrival rate jobs, which suggests it energy-efficient operate under utilization level high possible within latency requirement jobs. then derive closed-form upper bound for mean latency, provides simple characterization performance. Through simulation numerical experiments, exact value well approximated bound. further compare curve measured real implementation performance explained derived formula.

برای دانلود باید عضویت طلایی داشته باشید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Dynamic server allocation for unstable queueing networks with flexible servers

This paper is concerned with the dynamic assignment of servers to tasks in queueing networks where demand may exceed the capacity for service. The objective is to maximize the system throughput. We use fluid limit analysis to show that several quantities of interest, namely the maximum possible throughput, the maximum throughput for a given arrival rate, the minimum arrival rate that will yield...

متن کامل

Queueing Systems with Synergistic Servers

We consider tandem lines with finite buffers and flexible, heterogeneous servers who are synergistic in that they work more effectively in teams than on their own. Our objective is to determine how the servers should be assigned dynamically to tasks in order to maximize the long-run average throughput. In particular, we investigate when it is better to take advantage of synergy among servers, r...

متن کامل

Deterministic Analysis of Queueing Systems with Heterogeneous Servers

Using deterministic (sample-path) analysis, we generalize and extend fundamental properties of systems with “stationary deterministic flows” as introduced by Gelenbe (1983) and Gelenbe and Finkel (1987). Primarily, we provide conditions for stability and instability for general queueing models, and focus attention on multichannel queueing systems with servers that work at different rates. Stabi...

متن کامل

A Queueing System with Auxiliary Servers*

We examine a queueing system with multiple primary servers and a fewer number of auxiliary servers. There are two classes of customers-those who require service from a primary server working alone and those who require service from a primary server who is assisted by an auxiliary server. Though the apparent Markovian state space is fivedimensional, we show that an aggregation results in an exac...

متن کامل

Low Latency RNN Inference with Cellular Batching

Performing inference on pre-trained neural network models must meet the requirement of low-latency, which is often at odds with achieving high throughput. Existing deep learning systems use batching to improve throughput, which do not perform well when serving Recurrent Neural Networks with dynamic dataflow graphs. We propose the technique of cellular batching, which improves both the latency a...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

ژورنال

عنوان ژورنال: Performance Evaluation

سال: 2021

ISSN: ['0166-5316', '1872-745X']

DOI: https://doi.org/10.1016/j.peva.2020.102183